Reducing I/O Cost in OLAP Query Processing with MapReduce

نویسندگان

  • Woo-Lam Kang
  • Hyeon Gyu Kim
  • Yoon-Joon Lee
چکیده

This paper presents a method to reduce I/O cost in MapReduce when online analytical processing (OLAP) queries are used for data analysis. The proposed method consists of two basic ideas. First, to reduce network transmission cost, mappers are organized to receive only data necessary to perform a map task, not an entire set of input data. Second, to reduce storage consumption, only record IDs are stored for checkpointing, not the raw records. Experiments conducted with TPC-H benchmark show that the proposed method is about 40% faster than Hive, the well-known data warehouse solution for MapReduce, while reducing the size of data stored for checkpoining to about 80%. key words: MapReduce, Hadoop, OLAP, data warehouse, TPC-H benchmark

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra

MapReduce-based data processing platforms offer a promising approach for cost-effective and Web-scale processing of Semantic Web data. However, one major challenge is that this computational paradigm leads to high I/O and communication costs when processing tasks with several join operations typical in SPARQL queries. The goal of this demonstration is to show how a system RAPID+, an extension o...

متن کامل

Physical Data Warehouse Design on NoSQL Databases - OLAP Query Processing over HBase

Nowadays, data warehousing and online analytical processing (OLAP) are core technologies in business intelligence and therefore have drawn much interest by researchers in the last decade. However, these technologies have been mainly developed for relational database systems in centralized environments. In other words, these technologies have not been designed to be applied in scalable systems s...

متن کامل

Business Intelligence and Nosql Databases

NoSQL databases become more and more popular, not only in typical Internet applications. They allow to store large volumes of data (so called big data), while ensuring fast retrieving and fast appending. The main disadvantage of NoSQL databases is that they do not use relational model of data and usually do not offer any declarative query language similar to SQL. This raises the question how No...

متن کامل

Cloud-Aware Processing of MapReduce-Based OLAP Applications

As the volume of data to be processed in a timely manner soars, the scale of computing and storage systems has much trouble keeping up with such a rate of explosive data growth. A hybrid cloud combining two or more clouds is emerging as an appealing alternative to expand local/private systems. However, the effective use of such an expanded cloud system is limited primarily by low network bandwi...

متن کامل

Efficient Multi-way Theta-Join Processing Using MapReduce

Multi-way Theta-join queries are powerful in describing complex relations and therefore widely employed in real practices. However, existing solutions from traditional distributed and parallel databases for multi-way Theta-join queries cannot be easily extended to fit a shared-nothing distributed computing paradigm, which is proven to be able to support OLAP applications over immense data volum...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEICE Transactions

دوره 98-D  شماره 

صفحات  -

تاریخ انتشار 2015